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1 Performance of Firefly RPC 
M. Schroeder, M. Burrows 

November 1989 ACM SIGOPS Operating Systems Review , Proceedings of the twelfth 
ACM symposium on Operating systems principles, volume 23 issue 5 

Additional Information: full citation , abstract , references , citings , index 
terms 



Full text available: H pdfd. 03 MB) 



In this paper, we report on the performance of the remote procedure call implementation for 
the Firefly multiprocessor and analyze the implementation to account precisely for all 
measured latency. From the analysis and measurements, we estimate how much faster RPC 
could be if certain improvements were made. The elapsed time for an inter-machine call to a 
remote procedure that accepts no arguments and produces no results is 2.66 milliseconds. 
The elapsed time for an RPC that has a ... 

Compilers I: A performance analysis of the Berkeley UPC compiler j 
Parry Husbands, Costin Iancu, Katherine Yelick 

June 2003 Proceedings of the 17th annual international conference on 
Supercomputing 

Full text available: J pdf(1 37.75 KB) Additional Information: full, citation, abstract, references, index terms 

Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data 
(SPMD) model of parallelism within a global address space. The global address space is used 
to simplify programming, especially on applications with irregular data structures that lead 
to fine-grained sharing between threads. Recent results have shown that the performance of 
UPC using a commercial compiler is comparable to that of MPI [7]. In this paper we describe 
a portable open source compiler for UPC. Ou ... 



Keywords: UPC, global address space, parallel, performance 
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January 2001 Australian Computer Science Communications , Proceedings of the 6th 
Australasian conference on Computer systems architecture, volume 23 issue 
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This paper looks at a combination of two techniques, one of which, using a vector instruction 
set, has a long history dating back to pipelined vector supercomputers, such as the Cray 1 
and its successors. The other technique, multi-threading, is also well understood. The novel 
approach proposed in this paper combines both vertical and horizontal micro-threading with 
vector instruction descriptors. It will be shown that a family of threads can represent a 
vector instruction with dependencies betw ... 

4 Performance of a micro-threaded pipeline 
Bing Luo, Chris Jesshope 

January 2002 Australian Computer Science Communications , Proceedings of the 
seventh Asia-Pacific conference on Computer systems architecture - 
Volume 6, Volume 24 Issue 3 

Full text available: Q pdf (650 .49 KB) Additional Information: full c it a ti o n, a b strac t, references, index terms 

The micro-threaded microprocessor is a chip multi-processor, which uses a multi-threaded 
approach, where the threads are obtained from within a single context and exploit both 
vector and instruction level parallelism (ILP). This approach employs vertical and horizontal 
transfer in a simple pipeline. The horizontal transfer is referred to as the normal scalar 
pipeline processing used in most microprocessors. Vertical transfer is a context switch, 
which allows the code to tolerate any latency from ... 

Keywords: micro thread, micro-threaded pipeline, performance evaluation, scalar pipeline 
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Multithreading and value prediction: Speculative lock elision: enabling highly 
concurrent multithreaded 
Ravi Rajwar, James R. Goodman 

December 2001 Proceedings of the 34th annual ACM/IEEE international symposium on 
Microa rch itect u re 

Full text available: „ (f| 

TlU jaffltLaOflEl.^ Additional Information: full citation , abstract , references , citings 
P ub lis h er Site 

Serialization of threads due to critical sections is a fundamental bottleneck to achieving high 
performance in multithreaded programs. Dynamically, such serialization may be 
unnecessary because these critical sections could have safely executed concurrently without 
locks. Current processors cannot fully exploit such parallelism because they do not have 
mechanisms to dynamically detect such false inter-thread dependences. We propose 
Speculative Lock Elision (SLE), a novel micro-architectura ... 

Full-system timing-first simulation 
Carl J. Mauer, Mark D. Hill, David A. Wood 

June 2002 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 
2002 ACM SIGMETRICS international conference on Measurement and 
modeling of computer systems, volume 30 issue i 

Full text available: |j| pdf(87.83 KB) Additional Information: full citation , abstract , references , citings 

Computer system designers often evaluate future design alternatives with detailed 
simulators that strive for functional fidelity (to execute relevant workloads) and performance 
fidelity (to rank design alternatives). Trends toward multi-threaded architectures, more 
complex micro-architectures, and richer workloads, make authoring detailed simulators 
increasingly difficult. To manage simulator complexity, this paper advocates decoupled 
simulator organizations that separate functiona ... 

Slice-processors: an implementation of operation-based prediction 

Andreas Moshovos, Dionisios N. Pnevmatikatos, Amirali Baniasadi 

June 2001 Proceedings of the 15th international conference on Supercomputing 
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We describe the Slice Processor micro-architecture that implements a generalized operation- 
based prefetching mechanism. Operation-based prefetchers predict the series of operations, 
or the computation slice that can be used to calculate forthcoming memory references. This 
is in contrast to outcome-based predictors that exploit regularities in the (address) outcome 
stream. Slice processors are a generalization of existing operation-based prefetching 
mechanisms such as stream buffers where the ... 



Thin locks: featherweight synchronization for Ja va 
David F. Bacon, Ravi Konuru, Chet Murthy, Mauricio Serrano 

May 1998 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1998 conference 

on Programming language design and implementation, volume 33 issue 5 
Full text available' 11| pdf(1, 30 MB) Additional Information: full citation , abstract , references , citings, index 

terms 

Language-supported synchronization is a source of serious performance problems in many 
Java programs. Even single-threaded applications may spend up to half their time 
performing useless synchronization due to the thread-safe nature of the Java libraries. We 
solve this performance problem with a new algorithm that allows lock and unlock operations 
to be performed with only a few machine instructions in the most common cases. Our locks 
only require a partial word per object, and were implemented ... 



9 Technical contributions: FORTH for microcomputers 
John S. James 

October 1978 ACM SIGPLAN Notices, volume 13 issue io 

Full text available: pdf(749.78 KB) Additional Information: full citation , abstract , references , citings 

Forth is a unique threaded language ideally suited for microcomputers. Programs are 
incredibly compact; e.g. in 5K to 6K bytes you can get the interactive Forth compiler, 
running stand-alone as its own operating system including I/O drivers and other run-time 
routines, plus an assembler written in Forth (in case you want to optimize time-critical 
programs), virtual memory software, and a text editor. Not only does all this fit into 5K to 
6K bytes (of which 4K are written in Forth), but it runs i ... 



Ex perimental evaluation of the Hewlett-Parkard exemplar file system 
Rajesh Bordawekar, Steven Landherr, Don Capps, Mark Davis 

December 1997 ACM SIGMETRICS Performance Evaluation Review, volume 25 issue 3 
Full text available: ^) pdf(793.88 KB) Additional Information: full citation , abstract , index terms 

This article presents results from an experimental evaluation study of the HP Exemplar file 
system. Our experiments consist of simple micro-benchmarks that study the impact of 
various factors on the file system performance. These factors include I/O request/buffer 
sizes, vectored/non-vectored access patterns, read-ahead policies, multi-threaded 
(temporally irregular) requests, and architectural issues (cache parameters, NUMA behavior, 
etc.). Experimental results indicate that the Exemplar file s ... 



11 1998: Thin locks: featherweight Synchronization for Java 
David F. Bacon, Ravi Konuru, Chet Murthy, Mauricio J. Serrano 
April 2004 ACM SIGPLAN Notices, volume 39 issue 4 

Full text available: ^] pdf(1,94 MB) Additional Information: full citation , abstract , references 

Language-supported synchronization is a source of serious performance problems in many 
Java programs. Even single-threaded applications may spend up to half their time 
performing useless synchronization due to the thread-safe nature of the Java libraries. We 
solve this performance problem with a new algorithm that allows lock and unlock operations 
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to be performed with only a few machine instructions in the most common cases. Our locks 
only require a partial word per object, and were implemented ... 

12 H i g h-level scheduling model and control synthesis for a broad range of desi gn I I 

applications 

Chih-Tung Chen, Kayhan Kugukgakar 

November 1997 Proceedings of the 1997 IEEE/ACM international conference on 
Computer-aided design 

Full text available: ^ qMIIOM KB} Additional Information: full citation , abstract , references , citings , index 

W Publisher Site ^ms 

This paper presents a versatile scheduling model and an efficient control synthesis 
methodology which enables architectural (high-level) design/synthesis systems to 
seamlessly support a broad range of architectural design applications from datapath- 
dominated digital signal processing (DSP) to micro-processors/controllers and control- 
dominated peripherals, utilizing multi-phase clocking schemes, multiple threading, data- 
dependent delays, pipelining, and combinations of the above. The work present ... 

Keywords: control synthesis, scheduling model, multi-phase clocking, multi-threading, 
pipelining, relative scheduling, high-level synthesis, architectural synthesis, behavioral 
synthesis, architectural power optimization. 



13 Building a high-performance communication layer over virtual interface architecture on £j 
Lin ux clusters 

Jin-Soo Kim, Kangho Kim, Sung-In Jung 

June 2001 Proceedings of the 15th international conference on Supercomputing 

Full text available: ||| pdf(367.79 KB) Additional Information: full citat ion , abstract, refe re nces , index terms 

The Virtual Interface Architecture (VIA) is an industry standard user-level communication 
architecture for cluster or system area networks. The VIA provides a protected, directly- 
accessible interface to a network hardware, removing the operating system from the critical 
communication path. Although the VIA enables low-latency high-bandwidth communication, 
the application programming interface defined in the VIA specification lacks many high-level 
features. 
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